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Abstract 


Binary embedding is a nonlinear dimension reduction methodology where high dimensional 
data are embedded into the Hamming cube while preserving the structure of the original space. 
Specifically, for an arbitrary N distinct points in our goal is to encode each point using m- 

dimensional binary strings such that we can reconstruct their geodesic distance up to 6 uniform 
distortion. Existing binary embedding algorithms either lack theoretical guarantees or suffer 
from running time 0{mp^. We make three contributions: (1) we establish a lower bound that 
shows any binary embedding oblivious to the set of points requires m = P(^logiV) bits and 
a similar lower bound for non-oblivious embeddings into Hamming distance; (2) we propose a 
novel fast binary embedding algorithm with provably optimal bit complexity to = O (^ log A) 
and near linear running time 0{p\ogp) whenever log A <C with a slightly worse running 
time for larger log A; (3) we also provide an analytic result about embedding a general set 
of points K C with even infinite size. Our theoretical findings are supported through 

experiments on both synthetic and real data sets. 

1 Introduction 

Low distortion embeddings that transform high-dimensional points to low-dimensional space have 
played an important role in dealing with storage, information retrieval and machine learning prob¬ 
lems for modern datasets. Perhaps one of the most famous results along these lines is the Johnson- 
Lindenstrauss (JL) lemma Johnson and Lindenstrauss (1984), which shows that N points can be 
embedded into a 0[6~^ log A)-dimensional space while preserving pairwise Euclidean distance up to 
(5-Lipschitz distortion. This dependence has been shown to be information-theoretically optimal 
Alon (2003). Significant work has focused on fast algorithms for computing the embeddings, e.g., 
(Ailon and Chazelle, 2006; Krahmer and Ward, 2011; Ailon and Liberty, 2013; Cheraghchi et ah, 
2013; Nelson et ah, 2014). 
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More recently, there has been a growing interest in designing binary codes for high dimensional 
points with low distortion, i.e., embeddings into the binary cube (Weiss et ah, 2009; Raginsky and Lazebnik, 
2009; Salakhutdinov and Hinton, 2009; Liu et ah, 2011; Gong and Lazebnik, 2011; Yu et ah, 2014). 
Compared to JL embedding, embedding into the binary cube (also called binary embedding) has 
two advantages in practice: (i) As each data point is represented by a binary code, the disk size 
for storing the entire dataset is reduced considerably, (ii) Distance in binary cube is some function 
of the Hamming distance, which can be computed quickly using computationally efficient bit-wise 
operators. As a consequence, binary embedding can be applied to a large number of domains such 
as biology, finance and computer vision where the data are usually high dimensional. 

While most JL embeddings are linear maps, any binary embedding is fundamentally a nonlinear 
transformation. As we detail below, this nonlinearity poses significant new technical challenges for 
both upper and lower bounds. In particular, our understanding of the landscape is significantly less 
complete. To the best of our knowledge, lower bounds are not known; embedding algorithms for 
infinite sets have distortion-dependence 5 significantly exceeding their finite-set counterparts; and 
perhaps most significantly, there are no fast (near linear-time) embedding algorithms with strong 
performance guarantees. As we explain below, this paper contributes to each of these three areas. 

First, we detail some recent work and state of the art results. 

Recent Work. A common approach pursued by several existing works, considers the natural 
extension of JL embedding techniques via one bit quantization of the projections: 

h{x) = sign(Aa;), (1.1) 

where a; G is input data point, A G is a projection matrix and b{x) is the embedded binary 

code. In particular, Jacques et al. (2011) shows when each entry of A is generated independently 
from AA(0,1), with m > -^logN it with high probability achieves at most <5 (additive) distortion 
for N points. Work in Plan and Vershynin (2014) extend these results to arbitrary sets K C §,P~^ 
where \K\ can be infinite. They prove that the embedding with J-distortion can be obtained 
when m > w{K)‘^/6^ where w{K) is the Gaussian Mean Width of K. ft is unknown whether the 
unusual dependence is optimal or not. Despite provable sample complexity guarantees, one bit 
quantization of random projection as in (1.1), suffers from 0{mp) running time for a single point. 

This quadratic dependence can result in a prohibitive computational cost for high-dimensional data. 
Analogously to the developments in “fast” JL embeddings, there are several algorithms proposed 
to overcome this computational issue. Work in Gong et al. (2013) proposes a bilinear projection 
method. By setting m = 0{p), their method reduces the running time from 0{p^) to 0{p^'^). More 
recently, work in Yu et al. (2014) introduces a circulant random projection algorithm that requires 
running time O(plogp). While these algorithms have reduced running time, as of yet they come 
without performance guarantees: to the best of our knowledge, the measurement complexities of 
the two algorithms are still unknown. Another line of work considers learning binary codes from 
data by solving certain optimization problems (Weiss et al., 2009; Salakhutdinov and Hinton, 2009; 
Norouzi et al., 2012; Yu et al., 2014). Unfortunately, there is no known provable bits complexity 
result for these algorithms. It is also worth noting that Raginsky and Lazebnik (2009) provide 
a binary code design for preserving shift-invariant kernels. Their method suffers from the same 
quadratic computational issue compared with the fully random Gaussian projection method. 


2 


Another related dimension reduction technique is locality sensitive hashing (LSH) where the 
goal is to compute a discrete data structure such that similar points are mapped into the same 
bucket with high probability (see, e.g., Andoni and Indyk (2006)). The key difference is that LSH 
preserves short distances, but binary embedding preserves both short and far distances. For points 
that are far apart, LSH only cares that the hashings are different while binary embedding cares how 
different they are. 

Contributions of this paper. In this paper, we address several unanswered problems about 
binary embedding. We provide lower bounds for both data-oblivious and data-aware embeddings; 
we provide a fast algorithm for binary embedding; and finally we consider the setting of infinite 
sets, and prove that in some of the most common cases we can improve the state-of-the-art sample 
complexity guarantees by a factor of 

1. We provide two lower bounds for binary embeddings. The first shows that any method for 

embedding and for recovering a distance estimate from the embedded points that is indepen¬ 
dent of the data being embedded must use Q{-^log N) bits. This is based on a bound on 
the communication complexity of Hamming distance used by Jayram and Woodruff (2013) for 
a lower bound on the “distributional” JL embedding. Separately, we give a lower bound for 
arbitrarily data-dependent methods that embed into (any function of) the Hamming distance, 
showing such algorithms require m = log IV). This bound is similar to Alon (2003) 

which gets the same result for JL, but the binary embedding requires a different construction. 

2. We provide the first provable fast algorithm with optimal measurement complexity O (^log JV). 

The proposed algorithm has running time 0(^ log j log^ A^logplog^ log -|-plogp) thus has 
almost linear time complexity when logA^ < Our algorithm is based on two key novel 

ideas. First, our similarity is based on the median Hamming distance of sub-blocks of the bi¬ 
nary code; second, our new embedding takes advantage of a pair-wise independence argument 
of Gaussian Toeplitz projection that could be of independent interest. 

3. For arbitrary set K C and the fully random Gaussian projection algorithm, we prove 

that m = is sufficient to achieve J uniform distortion. Here K~^ is an expanded 

set of K. Although in general K C and hence w{K) < w{K~^), for interesting K such 
as sparse or low rank sets, one can show w{K~^) = Q{w{K)) p- Therefore applying our 
theory to these sets results in an improved dependence on 5 compared to a recent result in 
Plan and Vershynin (2014). See Section 3.3 for a detailed discussion. 

Discussion. For the fast binary embedding, one simple solution, to the best of our knowledge not 
previously stated, is to combine a Gaussian projection and the well known results about fast JL. In 
detail, consider the strategy b{x) = sign(AFa;), where A is a Gaussian matrix and F is any fast JL 
construction such as subsampled Walsh-Hadamard matrix Rudelson and Vershynin (2008) or partial 
circulant matrix Krahmer et al. (2014) with column flips. A simple analysis shows that this approach 
achieves measurement complexity 0(^ log N) and running time 0{-^ log^ N logp log^ log Aj-p logp) 
by following the best known fast JL results. Our fast binary embedding algorithm builds on this 
simple but effective thought. Instead of using a Gaussian matrix after the fast JL transform, we 
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use a series of Gaussian Toeplitz matrices that have fast matrix vector multiplication. This novel 
construction improves the running time by <5^ while keeping measurement complexity the same. In 
order for this to work, we need to change the estimator from straight Hamming distance to one 
based on the median of several Hamming distances. 

An interesting point of comparison is Ailon and Rauhut (2014), which considers “RIP-optimal” 
distributions that give JL embeddings with optimal measurement complexity 0{-^log N) and run¬ 
ning time 0{plogp). They show the existence of such embeddings whenever logA^ < for 

any constant 7 > 0 , which is essentially no better than the bound given by the folklore method 
of composing a Gaussian projection with a subsampled Fourier matrix. In our binary setting, we 
show how to improve the region of optimality by a factor of 6. It would be interesting to try and 
translate this result back to the JL setting. 

Notation. We use [re] to denote natural number set {1, 2, ... , re}. For natural numbers a < b, let 
[a, b] denote the consecutive set {a, a -|- 1,..., 6 }. A vector in R"" is denoted as x or equivalently 
{xi,X 2 , ■ ■ ■ ■iXn)~^■ We use Xx to denote the sub-vector of x with index set X C [re]. We denote 
entry-wise vector multiplication as x Q y = {xiyi,X 2 y 2 , ■ ■ ■ ,Xnyn)~^■ A matrix is typically denoted 
as M. Term (i,j) of M is denoted as Mj^-. Row z of M is denoted as M*. An re-by-re identity 
matrix is denoted as I„. For two random variables X,Y, we denote the statement that X and Y 
are independent as XYY. For two binary strings a,b £ {0,1}™', we use d-}i{a,b) to denote the 
normalized Hamming distance, i.e., dxi{a, b) := X ^™ ^ 1 ( 0 * 7 ^ 6 j). 

2 Organization, Problem Setup and Preliminaries 

In this section, we state our problem formally, give some key definitions and present a simple 
(known) algorithm that sets the stage for the main results of this paper. The algorithm (Algorithm 
1), discussed in detail below, is simply the one-bit quantization of a standard JL embedding. Its 
performance ore finite sets is easy to analyze, and we state it in Proposition 2.2 below. Three 
important questions remain unanswered: (i) Lower Bounds - is the performance guaranteed by 
Proposition 2.2 optimal? We answer this affirmatively in Section 3.1. (ii) Fast Embedding - whereas 
Algorithm 1 is quadratic (depending on the product mp), fast JL algorithms are nearly linear in p; 
does something similar exist for binary embedding? We develop a new algorithm in Section 3.2 that 
addresses the complexity issue, while at the same time guaranteeing J-embedding with dimension 
scaling that matches our lower bound. Interestingly, a key aspect of our contribution is that we use 
a slightly modified similarity function, using the median of the normalized Hamming distance on 
sub-blocks, (hi) Infinite Sets - recent work analyzing the setting of infinite sets K C shows a 
dependence of on the distortion. Is this optimal? We show in Section 3.3 that in many settings 
this can be improved by a factor of <5“^. In Section 4, we provide numerical results. We give most 
proofs in Section 5. 

2.1 Problem Setup 

Given a set of p-dimensional points, our goal is to find a transformation / : R^’ 1 —)• {0,1}™ such 
that the Hamming distance (or other related, easily computable metric) between two binary codes 
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is close to their similarity in the original space. We consider points on the unit sphere and use 
the normalized geodesic distance (occasionally, and somewhat misleadingly, called cosine similarity) 
as the input space similarity metric. For two points x,y £ M^, we use d{x, y) to denote the geodesic 
distance, defined as 

^Wll^ll2,!//llyll2)_ 

TT 

where Z(-, •) denotes the angle between two vectors. For x,y £ the metric d{x,y) is propor¬ 

tional to the length of the shortest path connecting x, y on the sphere. 

Given the success of JL embedding, a natural approach is to consider the one bit quantization 
of a random projection: 

b = sign(Aa;), (2.1) 


where A is some random projection matrix. Given two points x, y with embedding vectors b, and 
c, we have 7 ^ Cj if and only if (A,, a^^Aj, y'^ < 0. The traditional metric in the embedded space 
has been the so-called normalized Hamming distance, which we done by dA{x,y) and is defined as 
follows. 

^ m 

dA{x,y) := — Vl 

i=\ 

Definition 2.1. ((5-uniform Embedding) Given a set K C 8 ^“^ and projection matrix A E 
we say the embedding b = sign(Aa;) provides a 5-uniform embedding for points in K if 


sign((Ai,a;)) / sign((Ai, y)) 


( 2 . 2 ) 


\dA{x, y) - d{x, y)\ <6, x,y £ K. 


(2.3) 


Note that unlike for JL, we aim to control additive error instead of relative error. Due to 
the inherently limited resolution of binary embedding, controlling relative error would force the 
embedding dimension m to scale inversely with the minimum distance of the original points, and 
in particular would be impossible for any infinite set. 


2.2 Uniform Random Projection 


Algorithm 1 Uniform Random Projection 

input Finite number of points K = where K C embedding target dimension m. 

1: Construct matrix A £ where each entry Aij is drawn independently from M{0, 1). 

2: for i = 1, 2,..., IiL| do 
3 : bi £- sign(Aa;i). 

4 : end for 
output 


Algorithm 1 presents (2.1) formally, when A is an i.i.d. Gaussian random matrix, i.e., Aj ~ 
AA(0,Ip) for any i £ [m]. It is easy to observe that for two fixed points x,y £ we have 

Ie|^i| sign ((Ai,®)) / sign ((Ai,y))|^ = d{x,y), V i £ [m]. (2.4) 
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The above equality has a geometric explanation: each Aj actually represents a uniformly distributed 
random hyperplane in W^. Then sign 7 ^ sign ((Aj,y)) holds if and only if hyperplane Aj 

intersects the arc between x and y. In fact, dA{x,y) is equal to the fraction of such hyperplanes. 
Under such uniform tessellation, the probability with which the aforementioned event occurs is 
d{x,y). Applying Hoeffding’s inequality and probabilistic union bound over pairs of points, we 
have the following straightforward guarantee. 

Proposition 2.2. Given a set K C with finite size \K\, consider Algorithm 1 with m > 

c{l/6'^) log \K\. Then with probability at least 1 — 2exp(—d^m), we have 

\dA{x,y) - d{x,y)\ <6, Vx,y e K. 

Here c is some absolute constant. 

Proof. The proof idea is standard and follows from the above; we omit the details. □ 

3 Main Results 

We now present our main results on lower bounds, on fast binary embedding, and finally, on a 
general result for infinite sets. 

3.1 Lower Bounds 

We offer two different lower bounds. The first shows that any embedding technique that is obliv¬ 
ious to the input points must use H(^logA) bits, regardless of what method is used to estimate 
geodesic distance from the embeddings. This shows that uniform random projection and our fast 
binary embedding achieve optimal bit complexity (up to constants). The bound follows from results 
by Jayram and Woodruff (2013) on the communication complexity of Hamming distance. 

Theorem 3.1. Consider any distribution on embedding functions / : —)• {0,1}™ and recon¬ 

struction algorithms g : { 0 , 1 }™ x { 0 , 1 }™ —)• M such that for any Xi,... ,Xa! G we have 

\9ifixi)Jixj))-d{xi,Xj)\ <6 

for all i,j S [N] with probability 1 — e. Then m = Q(-^ log(A/e)). 

Proof. See Section 5.1 for detailed proof. □ 

One could imagine, however, that an embedding could use knowledge of the input point set to 
embed any specific set of points into a lower-dimensional space than is possible with an oblivious 
algorithm. In the Johnson-Lindenstrauss setting, Alon (2003) showed that this is not possible 
beyond (possibly) a log(l/(5) factor. We show the analogous result for binary embeddings. Relative 
to Theorem 3.1, our second lower bound works for data-dependent embedding functions but loses a 
log(l/(5) and requires the reconstruction function to depend only on the Hamming distance between 
the two strings. This restriction is natural because an unrestricted data-dependent reconstruction 
function could simply encode the answers and avoid any dependence on 5. 
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With the scheme given in (2.1), choosing A as a fully random Gaussian matrix yields dx{x, y) ~ 
d{x,y). However, an arbitrary binary embedding algorithm may not yield a linear functional 
relationship between Hamming distance and geodesic distance. Thus for this lower bound, we allow 
the design of an algorithm with arbitrary link function C. 


Definition 3.2. (Data-dependent binary embedding problem) 

Let C : [0,1] —>• [0,1] be a monotonic and continuous function. Given a set of points xi,X 2 , ...,X]\i S 
we say a binary embedding mapping / solves the binary embedding problem in terms of link 
function C, if 

\dn{fixi)Jixj)) - £{d{xi,Xj)) \ <5, Vz,jG[A]. (3.1) 

Although the choice of C is flexible, note that for the same point, we always have d-}{{f{xi), f{xi)) = 
d{xi,Xi) = 0, thus (3.1) implies £(0) < <5. We can just let £(0) = 0. In particular, we let 
^max = ^(1)- We have the following lower bound: 

Theorem 3.3. There exist 2N points Xi,X 2 , ■.■,X 2 n £ such that for any binary embedding 

algorithm / on {xi}‘f^-^^, if it solves the data-dependent binary embedding problem defined in 3.2 in 
terms of link function L and any 6 G (0, yg^Tmax); it must satisfy 


m > 


1 fc 


128e 


“ logiV 
log' 


2S 


(3.2) 


Proof. See Section 5.2 for detailed proof. 


□ 


Remark 3.4. We make two remarks for the above result. (1) When Tmax is some constant, our 

result implies that for general N points, any binary embedding algorithm (even data-dependent 

) must have 12(^2]^-^-log A) number of measurements. This is analogous to Alon’s lower bound 
° s 

in the JL setting. It is worth highlighting two differences: (i) The JL setting considers the same 
metric (Euclidean distance) for both the input and the embedded spaces. In binary embedding, 
however, we are interested in showing the relationship between Hamming distance and geodesic 
distance, (ii) Our lower bound is applicable to a broader class of binary embedding algorithms as it 
involves arbitrary, even data-dependent, link function C. Such an extension is not considered in the 
lower bound of JL. (2) The stated lower bound only depends on Tmax and does not depend on any 
curvature information of C. The constraint Tmax > is critical for our lower bound to hold, 

but some such restriction is necessary because for Tmax < we are able to embed all points into 
just one bit. In this case dy^f^Xi), f{xj)) = 0 for all pairs and condition (3.1) would hold trivially. 


3.2 Fast Binary Embedding 

In this section, we present a novel fast binary embedding algorithm. We then establish its theoretical 
guarantees. There are two key ideas that we leverage: (i) instead of normalized Hamming distance, 
we use a related metric, the median of the normalized Hamming distance applied to sub-blocks; 
and (ii) we show a key pair-wise independence lemma for partial Gaussian Toeplitz projection, that 
allows us to use a concentration bound that then implies nearness in the median-metric we use. 
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3.2.1 Method 


Our algorithm builds on sub-sampled Walsh-Hadamard matrix and partial Gaussian Toeplitz ma¬ 
trices with random column flips. In particular, an m-hy-p partial Walsh-Hadamard matrix has the 
form 

$:=P H D. (3.3) 

The above construction has three components. We characterize each term as follows: 

• Term D is a p-hy-p diagonal matrix with diagonal terms {Ci}f=i that are drawn from i.i.d. 
Rademacher sequence, i.e, for any i G [p], Pr(^j = 1) = Pr(^j = —1) = 1/2. 

• Term H is a p-hy-p scaled Walsh-Hadamard matrix such that H^H = Ip. 

• Term P is an m-hy-p sparse matrix where one entry of each row is set to be 1 while the rest 
are 0. The nonzero coordinate of each row is drawn independently from uniform distribution. 
In fact, the role of P is to randomly select p rows of H • D. 

An m-by-n partial Gaussian Toeplitz matrix has the form 

^':=P T D. (3.4) 


We introduce each term as follows: 

• Term D a is n-by-n diagonal matrix with diagonal terms that are drawn from i.i.d. 

Rademacher sequence. 

• Term T is a n-by-n Toeplitz matrix constructed from (2n — l)-dimensional vector g such that 
Tij- = gi-j-\-n for any i,j G [n]. In particular, g is drawn from A^(0, l 2 n-i)- 

• Term P is an m-by-n sparse matrix where Pj = ej for any i G [m]. Equivalently, we use P to 
select the first m rows of TD. It’s worth to note we actually only need to select any distinct 
m rows. 


With the above constructions in hand, we present our fast algorithm in Algorithm 2. At a high 
level. Algorithm 2 consists of two parts: First, we apply column flipped partial Hadamard transform 
to convert p-dimensional point into n-dimensional intermediate point. Second, we use B independent 
(m/B)-hy-n partial Gaussian Toeplitz matrices and sign operator to map an intermediate point 
into B blocks of binary codes. In terms of similarity computation for the embedded codes, we use 
the median of each block’s normalized Hamming distance. In detail, for 6, c G {0,1}™', R-wise 
normalized Hamming distance is defined as 


duib, c; B) := median ( < dn {bTi,CTi) 


B-l 


i=0 


(3.5) 


where Ti = [i + l,i + m/B], 

It is worth noting that our first step is one construction of fast JL transform. In fact any fast JL 
transform would work for our construction, but we choose a standard one with real value: based on 


Rudelson and Vershynin (2008); Cheraghchi et al. (2013); Krahmer and Ward (2011), it is known 
that with m = 0[e~^logNlogplog^{logN)'^ measurements, a subsampled Hadamard matrix with 
column flips becomes an e-JL matrix for N points. 

The second part of our algorithm follows framework (2.1). By choosing a Gaussian random 
vector in each row of T, from our previous discussion in Section 2.2, the probability that such a 
hyperplane intersects the arc between two points is equal to their geodesic distance. Compared to 
a fully random Gaussian matrix, as used in Algorithm 1, the key difference is that the hyperplanes 
represented by rows of T are not independent to each other; this imposes the main analytical 
challenge. 

Algorithm 2 Fast Binary Embedding 

input Finite number of points {xi}^^ where each point Xi S embedded dimension m, inter¬ 

mediate dimension n, number of blocks B. 

1: Draw a n-hy-p sub-sampled Walsh-Hadamard matrix $ according to (3.3). Draw B independent 
partial Gaussian Toeplitz matrices with size {rn/B)-hy-n according to (3.4). 

2 : {Part I: Fast JL} 

3 : for z = 1, 2,..., A do 
4 : yi^^ Xi. 

5 : end for 

6 : {Part IP Partial Gaussian Toeplitz Projection} 

7: for z = 1, 2,..., A do 
8 : for j = 1,2,..., i? do 

9: CjM-sign (Tf yi). 

10: end for 

11: hi ^ [ci;c2;... ;cb] 

12: end for 
output {bilili 


3.2.2 Analysis 

We give the analysis for Algorithm 2. We first review a well known result about fast JL transform. 

Lemma 3.5. Consider the column flipped partial Hadamard matrix defined in (3.3) with size m- 
hy-p. For A points £Ci, * 2 ,..., let y* = V z £ [A]. For some absolute 

constant c, suppose m > C(5“^ log Alogylog^(log A), then with probability at least 0.99, we have 
that for any i,j£ [A] 

\\\yi - Vjh - \\xi - Xj\\ 2 \ < 5\\xi - Xj\\ 2 , (3.6) 

and for any z G [A] 


\yi\\2 - i| < 


(3.7) 


Proof. It can be proved by combining Theorem 14 in Cheraghchi et al. (2013) and Theorem 3.1 in 
Krahmer and Ward (2011). □ 
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The above result suggests that the first part of our algorithm reduces the dimension while 
preserving well the Euclidean distance of each pair. Under this condition, all the pairwise geodesic 
distances are also well preserved as conhrmed by the following result. 

Lemma 3.6. Consider the set of embedded points defined in Lemma 3.5. Suppose condi¬ 

tions (3.6)-(3.7) hold with (5 > 0. Then for any i,j £ [A^], 

\d{yi,yj) - d{xi,Xj)\<C5 (3.8) 

holds with some absolute constant C. 

Proof. We postpone the proof to Appendix A. □ 

The next result is our independence lemma, and is one of the key technical ideas that make our 
result possible. The result shows that for any fixed x, Gaussian Toeplitz projection (with column 
flips) plus sign(-) generate pair-wise independent binary codes. 

Lemma 3.7. Let g ~ AA(0, l 2 n-i), C = {Ci}i=i be an i.i.d. Rademacher sequence. Let T be a 
random Toeplitz matrix constructed from g such that Tjj = Consider any two distinct 

rows of T say For any two fixed vectors x,y £ M”’, we define the following random variables 

A = sign(^©C, ®), X'=sign(^'0 C,®); 
y = sign(^©C, y), T' = sign(4'© C,y)- 

We have 

ATA', X±V', Y±X', YXY'. 

Proof. See Section 5.3.1 for detailed proof. □ 

We are ready to prove the following result about Algorithm 2. 

Theorem 3.8. Consider Algorithm 2 with random matrices $, defined in (3.3) and (3.4) respec¬ 
tively. For finite number of points {xi}^^, let hi be the binary codes of Xi generated by Algorithm 
2. Suppose we set 

B>clogN, n > c'(l/(5^) log Alogplog^(log A), n > m/B > d'{1/5‘^), 

with some absolute constants c, c',c'', then with probability at least 0.98, we have that for any 
i,j £ [A] 

\d'H,{bi,hj]B) - d{xi,Xj)\ <5. 

Similarity metric d'gd, -^B) is the median of normalized Hamming distance defined in (3.5). 

Proof. See Section 5.3.2 for detailed proof. □ 


10 


The above result suggests that the measurement complexity of our fast algorithm is O (^logiV) 
which matches the performance of Algorithm 1 based on fully random matrix. Note that this 
measurement complexity can not be improved significantly by any data-oblivious binary embedding 
with any similarity metric, as suggested by Theorem 3.1. 

Running time: The first part of our algorithm takes time O(plogp). Generating a single block 
of binary codes from partial Toeplitz matrix takes time 0(nlog(j))^. Thus the total running time 
is 0[Bnlog j +p\ogp) = 0(^ log | log^ logplog^(log A^) +plogp). By ignoring the polynomial 

log log factor, the second term O(plogp) dominates when logA^ < ^\Jp/ log j. 

Comparison to an alternative algorithm: Instead of utilizing the partial Gaussian Toeplitz 
projection, an alternative method, to the best of our knowledge not previously stated, is to use 
fully random Gaussian projection in the second part of our algorithm. We present the details in 
Algorithm 3. By combining Proposition 2.2 and Lemma 3.5, it is straightforward to show this 
algorithm still achieves the same measurement complexity 0(^ log A^). The corresponding running 
time is 0(^ log^ A^logplog^(log A") +plogp), so it is fast when log A < Therefore our 

algorithm has an improved dependence on 6. This improvement comes from fast multiplication of 
partial Toeplitz matrix and a pair-wise independence argument shown in Lemma 3.7. 


Algorithm 3 Alternative Fast Binary Embedding 

input Finite number of points where each point Xi G embedded dimension m, inter¬ 

mediate dimension n. 

1 : Draw a n-by-p sub-sampled Walsh-Hadamard matrix $ according to (3.3). Construct m-by-n 
matrix A where each entry is drawn independently from A(0,1). 

2: for i = 1, 2,..., A do 
3 : bi ^ sign(A$a;j) 

4 : end for 
output {bJdi 


3.3 ^-uniform Embedding for General K 

In this section, we turn back to the fully random projection binary embedding (Algorithm 1). Recall 
that in Proposition 2.2, we show for finite size K, m = 0{^\og\K\) measurements are sufficient 
to achieve Auniform embedding. For general K, the challenge is that there might be an inhnite 
number of distinct points in K, so Proposition 2.2 cannot be applied. In proving the JL lemma for an 
inhnite set K, the standard technique is either constructing an e-net of K or reducing the distortion 
to the deviation bound of a Gaussian process. However, due to the non-linearity essential for binary 
embedding, these techniques cannot be directly extended to our setting. Therefore strengthening 
Proposition 2.2 to inhnite size K imposes signihcant technical challenges. Before stating our result, 
we hrst give some dehnitions. 

Definition 3.9. (Gaussian mean width) Let g A(0,Ip). For any set K C the Gaussian 

^Matrix-vector multiplication for m-by-n partial Toeplitz matrix can be implemented in running time O(nlogm). 
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mean width of K is defined as 


w{K) := Eg sup \ {g,x)\. 

x^K 


Here, w{K)‘^ measures the effective dimension of set K. In the trivial case K = we have 

w{K)‘^ < p. However, when K has some special structure, we may have w{K)‘^ <C p. For instance, 
when K = {x G : | supp(a;)| < s}, it has been shown that w{K) = 0(y^slog(p/s)) (see Lemma 
2.3 in Plan and Vershynin (2013)). 

For a given S, we define the expanded version of iF C as: 


Kf := K[ J |z G ^ : z = -r-, \/ x,y G K 6“^ < \\x — ylL < (5|. 

^ \\x - y\\2 


(3.9) 


In other words, is constructed from K by adding the normalized differences between pairs of 
points in K that are within <5 but not closer than (5^. Now we state the main result as follows. 

Theorem 3.10. Consider any K C Let A G be an i.i.d. Gaussian matrix where each 

row Aj ~ AA(0,Ip). For any two points x,y G K, dx{x,y) is defined in (2.2). Expanded set is 
defined in (3.9). When 


m > c 


w{K^ 


+ ^2 


54 


with some absolute constant c, then we have that 


sup \dAix,y) - d{x,y)\ < 6 

x.y^K 


holds with probability at least 1 — Ci exp(—C 2 (f^m) where Ci,C 2 are absolute constants. 


Proof. See Section 5.4 for detailed proof. 


□ 


Remark 3.11. We compare the above result to Theorem 1.5 from the recent paper Plan and Vershynin 
(2014) where it is proved that for m > w{K)‘^/6^, Algorithm 1 is guaranteed to achieve d-uniform 
embedding for general K. Based on definition (3.9), we have 

w{K) < w{Kj) < ^w{K -K)< ^w{K). 

Thus in the worst case. Theorem 3.10 recovers the previous result up to a factor More impor¬ 
tantly, for many interesting sets one can show w{Kf) < w{K)', in such cases, our result leads to an 
improved dependence on 6. We give several such examples as follows: 

• Low rank set. For some U G such that U'''U = I^, let K = {x G : x = Uc, V c G 

We simply have K = and w{K) < ^/r. Our result implies m = 0(r/(f^). 

• Sparse set. K = {x G : |supp(cc)| < s}. In this case we have C {a; G : 

I supp(£c)| < 2s}. Therefore w{K^) = Q{y/s log(p/s)). Our result implies m = O( ^ ^ ^ 

• Set with finite size. \K\ < 00. As w{K) < y^log |iL| and \K^\ < 2|A|, our result implies 
m = 0(log \K\/6‘^). We thus recover Proposition 2.2 up to factor l/d^. 

Applying the result from Plan and Vershynin (2014) to the above sets implies similar results 
but the dependence on 6 becomes l/d®. 
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4 Numerical Results 


In this section, we present the results of experiments we conduct to validate our theory and compare 
the performance of the following three algorithms we discussed: uniform random projection (URP) 
(Algorithm 1), fast binary embedding (FEE) (Algorithm 2) and the alternative fast binary embed¬ 
ding (FBE-2) (Algorithm 3). We first apply these algorithms to synthetic datasets. In detail, given 
parameters {N,p), a synthetic dataset is constructed by sampling N points from uniformly at 
random. Recall that 6 is the maximum embedding distortion among all pairs of points. We use m 
to denote the number of binary measurements. Algorithm FEE needs parameters n,B, which are 
intermediate dimension and number of blocks respectively. Eased on Theorem 3.8, n is required 
to be proportional to m (up to some logarithmic factors) and B is required to be proportional to 
log A^. We thus set n ~ 1.3m, B ~ 1.8 log iV. We also set n ~ 1.3m for FEE-2. In addition, we fix 
p = 512. We report our first result showing the functional relationship between (m, N, 5) in Figure 
1. In particular, panel 1(a) shows the the change of distortion 5 over the number of measurements 
m for fixed N. We observe that, for all the three algorithms, 5 decays with m at the rate predicted 
by Proposition 2.2 and Theorem 3.8. Panel 1(b) shows the empirical relationship between m and 
logA^ for hxed 5. As predicted by our theory (lower bound and upper bound), m has a linear 
dependence on log N. 
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(b) 5 = 0.3 


Figure 1: Results on synthetic datasets, (a) Each point, along with the standard deviation repre¬ 
sented by the error bar, is an average of 50 trials each of which is based on a fresh synthetic dataset 
with size N = 300 and newly constructed embedding mapping, (b) Each point is computed by 
slicing at 5 = 0.3 in similar plots like (a) but with the corresponding N. 


A popular application of binary embedding is image retrieval, as considered in (Gong and Lazebnik, 
2011; Gong et ah, 2013; Yu et ah, 2014). We thus conduct an experiment on the Fhckr-25600 dataset 
that consists of 10k images from Internet. Each image is represented by a 25600-dimensional nor¬ 
malized Fisher vector. We take 500 randomly sampled images as query points and leave the rest 
as base for retrieval. The relevant images of each query are defined as its 10 nearest neighbors 
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Number of retrieved images 


Number of retrieved images 


Number of retrieved images 


(a) m = 5000 


(b) m = 10000 


(c) m = 15000 


Figure 2: Image retrieval results on Flickr-25600. Each panel presents the recall for specified number 
of measurements m. Black and blue dot lines are respectively the recall of FBE-2 and URP with 
less number of measurements but the same running time as FBE. 


based on geodesic distance. Given m, we apply FBE, FBE-2 and URP to convert all images into 
m-dimensional binary codes. In particular, we set R = 10 for FBE and n ~ 1.3m for FBE and 
FBE-2. Then we leverage the corresponding similarity metrics, (3.5) for FBE and Hamming dis¬ 
tance for FBE-2 and URP, to retrieve the nearest images for each query. The performance of each 
algorithm is characterized by recall, i.e., the number of retrieved relevant images divided by the total 
number of relevant images. We report our second result in Figure 2. Each panel shows the average 
recall of all queries for a specified m. We note that FBE-2, as a fast algorithm, performs as well as 
URP with the same number of measurements. In order to show the running time advantage of our 
fast algorithm FBE, we also present the performance of FBE-2 and URP with fewer measurements 
such that they can be computed with the same time as FBE. As we observe, with large number of 
measurements, FBE-2 and URP perform marginally better than FBE while FBE has a significant 
improvement over the two algorithms under identical time constraint. 


5 Proofs 

5.1 Proof of Data-Oblivious Lower Bound (Theorem 3.1) 

The proof of the data-oblivious lower bound is based on a lower bound for one-way communication 
of Hamming distance due to Jayram and Woodruff (2013). 

Definition 5.1 (One-way communication of Hamming distance). In the one-way communication 
model, Alice is given a £ {0, !}"■ and Bob is given b £ {0,1}”. Alice sends Bob a message 
c £ {0,1}™, and Bob uses b and c to output a value x £ M. Alice and Bob have shared randomness. 
Alice and Bob solve the ((5, e) additive Hamming distance estimation problem if \x — b)\ < 

6 with probability 1 — e. 

The result proven in Jayram and Woodruff (2013) is a lower bound for the multiplicative Ham¬ 
ming distance estimation problem, but their techniques readily yield a bound for the additive case 
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as well: 


Lemma 5.2. Any algorithm that solves the (5, e) additive Hamming distance estimation problem 
must have m = H((l/5^) log(l/e)) as long as this is less than n. 

Proof. We apply Lemma 3.1 of Jayram and Woodruff (2013) with parameters a = 2, p = 1, b = 1, 
e = 6, and 6 = e. This encodes inputs from a problem they prove is hard (augmented indexing on 
large domains) to inputs appropriate for Hamming estimation. In particular, for n' = 0{^ log(l/e)) 
it gives a distribution on (a, h) e {0,1}"'" x {0,1}"'" that are divided into “NO” and “YES” instances, 
such that: 

• From the reduction, distinguishing NO instances from YES instances with probability 1 — e 
requires Alice to send m = H(^ log(l/e)) bits of communication to Bob. 

• In NO instances, d-^{a,b) > ^(1 — S/3). 

• In YES instances, d'n{a,b) < ^(1 — 26/3). 

First, suppose n = n'. Then since solving the additive Hamming distance estimation problem 
with 5/12 accuracy would distinguish NO instances from YES instances, it must involve m = 
H(^log(l/e)) bits of communication. 

For n > n', simply duplicate the coordinates of a and h \ n/n'\ times, and zero-pad the remainder. 
Less than half the coordinates are then part of the zero-padding, so the gap between YES and NO 
instances remains at least 5/12 and a protocol for the (5/24, e) additive Hamming distance estimation 
problem requires m = H(^log(l/e)) as desired. □ 

With this in hand, we can prove Theorem 3.1: 

Proof of Theorem 3.1. We reduce one-way communication of the (5, e) additive Hamming distance 
estimation problem to the embedding problem. Let a, 6 £ {0,1}^ be drawn from the hard instance 
for the communication problem defined in Lemma 5.2. Linearly transform them to u,v £ via 
u = {2 ■ a — \)/y/p, V = {2 ■ b — \)/y/p. We have that {u,v) = 1 — 2d'^(a, b), so 

,, , , arccos((it, r;)) , arccos(l — 2(i«(a, 6)) 

d{u, v) = 1 --V- 

TT vr 

or ^ 

du{a,b) = -(1 - cos(7r - 7rd{u,v))) 

Given an estimate of d{u,v), we can therefore get an estimate of d'n{a,b). In particular, since 
|cos'(x)| < 1, if we learn d{u,v) to ±5 then we learn d'}i{a,b) to ±5^. 

For now, consider the case of A^ = 2. Consider an oblivious embedding function / : ^ 

{0,1}™ and reconstruction algorithm g : {0,1}”^ x {0,1}”^ —)■ M that has 

laifiu), f{v)) - d{u, r»)| <5- 

TT 

with probability 1 — e on the distribution of inputs {u, v). We can solve the one-way communication 
problem for Hamming distance estimation by Alice sending f{u) to Bob, Bob learning d{u,v) ~ 
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g{f{u),f{v)), and then computing d-}{{a,b) to ±<5. By the lower bound for this problem, any such 
/ and g must have m = log i), proving the result for = 2 (after rescaling 6). 

For general N, we draw instances {ui,Vi), {u2,V2), ■ ■ ■, {uf^/ 2 ^'^N/ 2 ) independently from the 
hard instance for binary embedding of = 2 and e' = 4e/A^. Consider an oblivious embedding 
function / : —>• {0,1}"* and reconstruction algorithm g : {0,1}™ x {0,1}™ —M that has for 

all i £ [A^/2] that 

f{vi)) - d{ui,Vi)\ < 6 

with probability 1—e on this distribution. Define a to be the probability that \g{f{ui), f{vi)) — d{ui,Vi)\ < 
6 for any particular i. Because / and g are oblivious and the different instances are independent, 
we have the probability that all instances succeed is > 1 — e, so 

a > (1 - e)2/^ > 1 - Ae/N. 

In particular, this means / and g solve the hard instance of binary embedding and N = 2, e' = 4e/A^. 

By the above lower bound for N = 2, this means 

^ = ^(^log(A^/e)) 

as desired. □ 

5.2 Proof of Data-Dependent Lower Bound (Theorem 3.3) 

We need a few ingredients to show the lower bound. First, we dehne a matrix that is close to 
identity matrix. 

Definition 5.3. (((5i, (52)-near identity matrix) Symmetric matrix M £ is called a ((5i, (52)-near 
identity matrix if it satisfies both of the following conditions: 


1-di < < 1,V i £ [p]. 


Mij| < (52, V \p]. 


Next we give a lower bound on the rank of (di, (52)-near identity matrix. 


Lemma 5.4. Suppose positive semidefinite matrix M £ is a (di, (52)-near identity matrix with 
rank d, and 0 < (5i, 52 < 1 ■ Then we have 


P(1 - <^i) 

1 + ip- 1)5| 


Proof. We postpone the proof to Appendix B. 


□ 


The above result is weak when it is applied to show our desired lower bound. We still need to 
make use of the following combinatorial result. 
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Lemma 5.5. Suppose matrix M E has rank d. Let P{x) be any degree k polynomial function. 
Consider matrix N G defined as N := P(M), where the Njj = P(Mjj). We have 

k + d\ 
k )■ 

Proof. See Lemma 9.2 of Alon (2003) for a detailed proof. □ 

Now we are ready to prove Theorem 3.3. 

Proof of Theorem 3.3. Let Cj denote the i’th natural basis of i.e., the i’th coordinate is 1 while 
the rest are all zeros. Consider N points {ei, 62 ,..., ejy} and their opposite vectors {—ei, — 62 ,..., —e^v}- 
For any binary embedding algorithm /, we let 


rank(N) < 


bi := fiei), V i G [A^], 

Ci ;= f{-ei), V i G [N]. 

Under the condition that / solves the general binary embedding problem with link function C, 
we have 


As d{ei, —Bi) = 1 , we have 
Similarly, note that 

we have V i 7 ^ j 


\dn{bi,Ci) - C{d{ei,-ei))\ < (5,Vi G [A^]. 
£( 1 ) + 5 > du{bi, Ci) > £( 1 ) - 6. 
d{ei,ej) = d{ei,-ej) = d{-ei,-ej) = ^, Vi / 

£( 1 / 2 ) - ,5 < dH{b^, bj) < £( 1 / 2 ) + <5, 


(5.1) 

(5.2) 


(5.3) 


£( 1 / 2 ) - 6 < dn{ci,Cj) 

< £( 1 / 2 ) + 5, 

(5.4) 

£(1/2) - 5 < duibi,Cj) 

<£(l/ 2 )+J. 

(5.5) 


From now on, we treat binary strings bi,Ci as vectors in Let B denote the matrix with rows 
bi and C denote the matrix with rows Cj. Consider the outer product of the difference between B 
and C, namely 

M = (B - C)(B - C)^. 

Note that V i G [A^], 

= \\bi - CiWl = Am ■ dn{bi,Ci) > 4m(£(l) - (5). 

The last inequality follows from (5.2). For V i 7 ^ j, we have 

Mfj = {bi - Ci, bj - Cj) = {bi, bj) + {ci, Cj) - {bi, Cj) - {bj,Ci) 

= 2m( dn(bi,Cj) +dn(bj,Ci) -dH{bi,bj) -dn{ci,Cj)), 
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where the third equality follows from 


dH{b,c) = ^(ll^lli + Ml - 2(b,c)) V 6,c G {-1,1}' 


| < 85m. 


By using (5.3) to (5.5), we have 


Therefore, 4 m.(/:}i)+< 5 ) ^ actually a ^ i^f) j'^^^ar identity matrix. Consider degree k polyno¬ 
mial P{z) = . Let 


N = P( 


1 


-M). 


'4m • £(1)' 

It is easy to observe that N is a ( 7 i, 72 )-near identity matrix where 

25 


71 = 1 - (1 - 


C{1)- 


and 


/ 25 \ k 
^ ^£(1)^ ■ 


Under the condition < 7 , we have 


71 = 1- (1- ^)fc < i_ 


By setting k = I , we have 


72 <a/^. 


We apply Lemma 5.4 by setting 5i,52,p in the statement to be 71 , 72 , 1 V respectively. We get 

'life 

/VI 

rank(N) > 


> -(-)^iV > (-)^iV. 


l-h(iV-l)/lV “ 

On the other hand, 4 yy^.£(i) M has rank at most m. By applying Lemma 5.5 we get 


(5.6) 


rank(N) < 


m + k\ ,e{m + k)..k 


k 




k 


)*. 


Applying the above result and (5.6) directly yields that 

(JV)I/*^ < Se^. 

k 

When k = \ as we set, . Therefore we have 

2 log ^ ^ 25 ^ 


m 


1 /^(1)n2 


1 ,£(l)m, 1 .£(l )^2 logiV 


— 19c V X 1 — P.aY 9A ) 19«cV A ) 


2>2e^ 5 ' - 64e ^ 25 ^ 128e ^ <5 Mog 

where the second inequality holds when (^^^)^ > 64e. 


-^(1) 

25 


□ 
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5.3 Proofs about Fast Binary Embedding Algorithm 
5.3.1 Proof of Lemma 3.7 

Proof. It suffices to prove . One can check similarly that the proof holds for the remaining 

three results. Note that X,Y' are binary random variables with values {—1,1}- It is easy to 
observe both of them are balanced, namely Pr(A = 1) = Pr(y' = 1) = 1/2. If XIY', then we have 
Pr(A = Y') = 1/2. In the reverse direction, suppose Pr(A = Y') = 1/2. First we have 

Pr(A = 1) = Pr(A = l.Y' = 1) + Pr(X = l,y' = -1) = 1/2, (5.7) 

Pr(y' = 1) = Pr(A = 1, y' = 1) + Pr(A = -1, Y' = 1) = 1/2. (5.8) 

Combining the above two results, we have Pr(A = 1,Y' = —1) = Pr(A = —l,y' = 1). Using 
Pr(A = l,y' = -1) + Pr(A = -l,y' = 1) = Pr(A ^ y') = 1 - Pr(A = y') = i, we thus have 

Pr(A = l,y' = —1) = Pr(A = —l,y' = 1) = 1/4. Plugging the above result into (5.7) and (5.8) 

we have Pr(A = l,y' = 1) = Pr(A = —l,y' = —1) = 1/4. Thus we have shown 

Pr(A = v\Y' = u) = ^ = u), V n, u e {-1,1}, 

' Pr(y' = u) 

which leads to A_Ly'. 

Using the above arguments, we show that XYY' if and only if 

Pr(A = y') = 1/2. 

Recalling the definition of X, Y' , the above condition holds if and only if 

Pr| '(^0C,a;) -^OC.y) > o| = i. 

z 

Next we prove Z has symmetric distribution around 0. Let X = [l,n],X' = [l,n — A],Xo = 
[2n — A, 2n — 1] for some natural number A < n. Without loss of generality, we assume ^ = Qi and 
= [gxo'iQx']- We split X into T = [^] consecutive disjoint subsets Xi,X 2 ) • • • ^Xx each of which 
has size A except \Xx\ = n — [T — 1)A < A. Also, let X'rp_.^ contain the first n — [T — 1)A entries 
of Xt-i- Then we have 

T-2 s 

'^{gxi® Cii+i, yx,+i )+^ 0 Cit > I/It ) + ( 9 x 0 0 Cii, yii)) • (5-9) 

i=l ^ 

We now let g be such random vector that is identical to g except that for any i £ {0} U [T] 

gii = -gXi, if i mod 2 = 0 

Let C be such random vector that is identical to ^ except that for any i G {0} U [T] 

Cy = -Cii, if i mod 2 = 1. 


Z = 


{gXiQCxi,xXi 


2 = 1 
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Replacing g, in (5.9) with g, yields 
Z 

T-2 


{dxi QCxi,xxi)] ■ f {gii 0 Cxi+i,yx,+i) + ^ © Cxt^VXt) + {dxo 0 Cxi,yx 

i=l ' i=l 

T X .T-2 

“ X] 0 Cli, ®Ti) ) • ( yy (i/Ti 0 CXi+i, yXi+i ) + {9X’^_^ © Cit ^yXr) + (STo © Cli, ^Ti ) 


2=1 


2 = 1 


= - Z. 


As each entry of g is symmetric random variable around 0, therefore g and g has the same probability 
distribution. The same fact also holds for and (^. So we conclude that Z has symmetric distribution 
around 0, which implies Pr(Z > 0) = ^ and X1.Y'. □ 


5.3.2 Proof of Theorem 3.8 

Proof. Unspecified notations in this section are consistent with Algorithm 2. Using Lemma 3.6, we 
have 

Pr < sup \d{yi,yj) — d{xi,Xj)\ > C6\ < O.Ol. (5.10) 

I i,je[N] } 

Now consider the first-block binary codes generated from Gaussian Toeplitz projection. We focus 
on two intermediate points yi and y 2 ■ Consider the first block of binary codes generated from the 
second part of Algorithm 2. We let 

u = sign (T^^^ • yij , v = sign (T^^^ • ^ 2 ) • 

Suppose contains Gaussian Toeplitz matrix T. For any i £ [m/B], we have 

Ui = sign ((Tj 0 C, yi)) = sign ((Tj, yi 0 C))- 

Vi = sign ((Tj 0 C, 2 / 2 )) = sign ((T*, r /2 0 C))- 
Since Tj is a Gaussian random vector, we have 


Pr(?Xi ^ Vi) = d{yi © C, 2/2 0 C) = d{yi, 2 / 2 ). 

Let Zi = l(tij / Ti),V i £ [m/B], Following Lemma (3.7), we know that ^ i ^ j 


UiJ-Uj, UiJ-Vj, ViJ-Vj, ViA-Uj. 


Therefore {Zi}^J2i^^ 


is a pair-wise independent sequence. By Markov’s inequality, we have 



iXxXil < AJL < 1. 

(5^ 4 4 


(5.11) 
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The last inequality holds by setting ^ Therefore, we have 

Pr (^duiu^v) - d{yi,y 2 )\ > 5^ < ^. 

Now consider total B block binary codes from yi and r /2 respectively. Let 

Ei = l{\dn{ui,Vi) - d{yi,y2)\ >S), Vie [B]. 

From (5.11), we have Pr(£'j = 1) < ;j. If more than half of Ei are 0, then the median of 
{dy{ui,Vi)}^^ is within 6 away from d{yi,y 2 )- Then we have 

Pr (^median {{dn{ui, Vi)}-d{yi,y 2 )\ > 6 
B B 

^ P'- (4 E s > 1) < Pr (1 f; - E(E.) > i) < exp(-iB). 

i=l i=l 

In the second inequality, we use (5.11). The last step follows from Hoeffding’s inequality. Now we 
use a union bound for pairs 

Pr ( sup \du{bi,hj)-d{yi,yj)\>^<N‘^exY>{-\B)<e^Y>{-\B). 

\i,je[N] y 4 « 

The last inequality holds by setting B > 16 log A^. Combing the above result and (5.10) using 
triangle inequality, we complete the proof. □ 

5.4 Proof of Theorem 3.10 

For any set K C we use Afs{K) to denote a constructed 5-net of K, which is a 5-covering set 
with minimum size. In particular, by Sudakov’s theorem (e.g.. Theorem 3.18 in Ledoux and Talagrand 
(1991)) 

logA45(iL) < 

We first prove that for a fixed two dimensional space, m = 0{-^) independent Gaussian mea¬ 
surements are sufficient to achieve 5-uniform binary embedding. 

Lemma 5.6. Suppose K is any fixed two-dimensional subspace in 8^“^. Let A G be a matrix 

with independent rows A, ~ A4(0, Ip), Vi G [m] . Suppose m > ^log j, then with probability at 
least 1 — 3exp(—5^m), 

sup \dAix,y) - d{x,y)\ < C6. (5.12) 

x,y£K 

Here C is some absolute constant. 

Proof. We postpone the proof to Appendix C. □ 

The next lemma shows that the normalized ii norm of Ax provides decent approximation of 

ll^lb- 
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Lemma 5.7. Consider any set K C Let A be an m-hy-p matrix with independent rows 
Aj ~ AA(0,Ip) for any i G [m] . Consider 


Z = sup 

xGK 


1 

m 


2=1 



We have 

where d{K) = maXa;gx ll^lb- 

Proof. See the proof of Lemma 2.1 in Plan and Vershynin (2014). □ 


In order to connect ii norm to Hamming distance, we need the following result. 

Lemma 5.8. Consider finite number of points K C Let A be an m-by-p matrix with 

independent rows Aj ~ AA(0, Ip) for any i G [m] . Suppose 

m > log \K\, 

then we have 

m . . 

x&\K\ ^ I J 

with probability at least 1 — exp(—5^m). 

Proof. Let X ~ AA(0,1). For any fixed point x £ K and any i G [m], we have 

Pr(|(Ai,a:)| < 5) = Pr(|A| < 5) < 5. 

Let Zi = l(|(Aj,cc)| < (5), Vi G [m]. Then by using Hoeffding’s inequality, 


Pr(— Zi — E(Zi) > (5) < exp(—2(5^m). 
m 


2 = 1 


As E(Zi) = Pr(|(A j,a;)| < (5) < 6, we conclude that with probability at least 1 — exp(—2(5^m) 


1 


— y Z* < 26. 

m ^ 


2 = 1 


By applying union bound over \K\ points and setting m> log \ K\, we complete the proof. □ 
Now we are ready to prove Theorem 3.10. 
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Proof of Theorem 3.10. We construct a 5-net of K that is denoted as A/^. We assume m > 
log lA/^l- Applying Proposition 2.2 and setting K = A/^, we have that 

sup \dx{x,y) - d{x,y)\<5 (5.13) 

x,y&Ns 

with probability at least 1 — 2exp(—5^m). 

For any two fixed points x,y £ K, let Xi,yi be their nearest points in A/5. Then we have 
\d{x,y) - dA{x,y)\ < \d{x,y) - d{xi,yi)\ + \d{xi,yi) - dA{x,y)\ 

(a) 

< \d{xi,yi) - dA{x,y)\ + 26 < \dA{xi,yi) - dA{x,y)\ + \d{xi,yi) - dA{xi,yi)\ +25 

(b) 

< \dA{xi,yi) - dA{x,y)\ + 35 < \dA{xi,yi) - dAixi,y)\ + \dA{xi,y) - dA{x,y)\ + 35 

(c) 

< dA{yi,y) + dA{xi,x) + 36, (5.14) 

where (a) follows from 

\d{x,y) - d{xi,yi)\ < \d{x,y) - d{xi,y)\ + \d{xi,y) - d{x,yi)\ < d{x,xi) + d{xi,yi) < 25, 

step (6) follows from (5.13), step (c) follows from the triangle inequality of Hamming distance. 
Therefore we have 


sup \dA{x,y) — d{x,y)\ < 2 sup sup dA{x,Xi) + 36. (5.15) 

x^y£K xi£j\fs x£K 

Next we bound the tail term 

T := sup sup dA{x,Xi). 

tciSA/i x&K 

II® — ®! ||2^<5 

Recall that 

K+ ■.= K\ J|z G : z = , Va;,y G ATif 52 < \\x - yh < 5|. 

^ \\x - yh 

Now we construct a 5-net for \ K denoted as Mg. For two distinct points x,y £ A/5IJA/5, let 
C{x, y) denote the unit circle spanned by x, y. We construct 5^-net Cg 2 {x, y) for each circle C{x, y). 
For simplicity, we just let Cg 2 {x,y) be the set of points that uniforirrly split C{x,y) with interval 
5^. We thus have \Cg 2 {x,y)\ < Let Gg denote the union of all circle nets Cg 2 {x,y) spanned by 
points in A/'^IJ-^5) namely 

gg:= IJ Cg 2 {x,y)U{x,y}. 

V x,y&Ng U-+5 

For any point x £ K, we can always find a point in Gs that is 0(5^) away from x. To see 
why the argument is true, we hrst let Xi be the nearest point to x in Ms. If \\x — Xih < 
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then Xi is the point we want. Otherwise, we have 5“^ < \\x — ®i||2 < h- In this case, we have 
{x — Xi)/\\x — iCill G K~^. Following the dehnition of K'^^ we can always hnd a point x'^ G U-^(5 
such that 

rr — ry»i 

(5.16) 


I , X — x^ II 

\x\ - - - —\\^ < 6, 


thereby 


\x - [\\x 


\\X-Xi\\2' 

- Xi\\ 2 x'i + Xi) II 2 < 5\\x - Xi\\2 < 5^. 


Note that ||2;||2 is very close to I because 

> II® — ^ ll^lli “ 2^z,a;) + I > ||z||| — 2 ||z||2 + I = (||2:||2 — l)^. 

We thus have 

ll® - Z/IIZII 2 II 2 < II® - 2:||2 + ll^: - Z/IIZII 2 II 2 = II® - 2:||2 + |||2:||2 - l| < 

Note that z is in the unit circle C(x, x[) spanned by x and x[, thereby there exists u G C^ 2 {x\,x']) 
such that ||u — ® II2 < Point u thus satisfies 

II® — i^ll < II® — 2:||2 + ||z — ri II 2 < 35^. (5.17) 

So for any x £ K and its nearest point Xi G A/^, we dehne u as 

{ ®i, \\x - xi \\2 < 

argmin^gc,2(a;i,a;'i) “ '^Il2> otherwise. 

where x'^ G A/5UA/5 and satisfies (5.16). Based on (5.17), we always have ||ii — x II2 < 3(5^ and 
llw — ®l||2 < llw — ®||2 + II® — ®l||2 < 2(5. 

By triangle inequality of Hamming distance, 

(iA(®, ®i) < hA(®, u) + d^iu, xi). 


We thus have 


T < sup sup dA{x,u) + dA.{u,xi) 

xi&Ms x&K 

II2 


< sup 
uGQs 


sup dA{x,u) + 

x&K 

a:—tt||2<3(5^ 


Ti 


sup sup dA{u,v) . 

x,y&N's[jM'^ u,vec{x,y) 

\\u—v\\2^25 


T2 


Next we bound term Ti and T 2 respectively. 

Term Ti. For a fixed point u G Gs, using Lemma 5.7 by setting {K,t) in the statement to be 
K' = {K — {u}) n{'*^ ^ • ll'^^lb < 3(5^} and 5^ respectively yields that 


Pr 


sup 

x£K 

\\x—u\\2<3S^ 

m5‘^ 


m ' ' 

2=1 

^ - 2exp(-Wl8)- 


® - U \\2 


>L1£1 + ^ 


m 


24 











Then with probability greater than 1 — 2exp(—m/18), 

1 Pi 

sup iSKA...-„)|<3W± + ^w{K')/y/m + < 5(5 


x^K ^ 
\\x — u\\2<SS‘^ 




where the last inequality follows from the fact that w{K') < w{K) and our assumption m 
w{Kp/5^. We define event 

f 1 ™ 

£ ■.= I sup sup —y) I (Aj, X — u)\ < 5(5^ 

I u&Qs x&K ^ 

llaj— ia||2<35^ 

Applying union bound over all points in Qs, we have 

Vx{£‘^) < 2|^5| exp(—m/18) < 2exp(—m/36), 

where the last inequality holds with m > log {Gsl- Under condition event £ happens, we have 


> 

('N_/ 


sup sup 
u&Qs x&K xn 
lltc—it||2<3(5^ 


m , . 


(5.18) 


If sign ((Aj, m)) / sign ((Aj, *)), we must have |^Aj,rt)| < |^Aj,M —a;)|. We then have 

1 ™ f 

Ti < sup sup — yy 1< |(^Aj,u)| < |^Aj,rr — a) 
u&Gs x&K xn [ 

\\x—u\\2<S5'^ 

m . . 

< sup — X] 1 ^ I (Aj, rr) I < 5(5 > + (5, 
ugGs ^ I J 

where the last inequality follows from (5.18). Using Lemma 5.7 by setting K and 5 in the statement 
to be Gs and 5(5 respectively, we have that, when m > c^log|^5| with some absolute constant c, 
the following inequality 

m . . 

-El IA..«>l<5d<ioj 


sup 

u&Gs ^ 


i=l 


holds with probability at least 1 — exp(—255^m). Putting all ingredients together, we have Ti < 115 
with high probability. 

Term T 2 . There are at most lAZ/IJA/'^p different two-dimensional subspaces constructed from 
A/IJA/"^. Applying Lemma 5.6 and probabilistic union bound over all subspaces yields that 

Pr ^^2 > {C + 2)5^ < 3|A// exp(—5^m) < 3exp(—5^m/2), 

where the last inequality holds by setting m > ^ log |A// U A/"^!. 

Putting (5.15) and the upper bounds of term T together, we conclude that by choosing 


m> maxi w{Kf/6^, logj^^l, ^ log |A// |J A/(| 


25 


we have 


sup \dA{x,y) - d{x,y)\ <6. 

x,y^K 


with probability at least 1 — ci exp(—C25^?7i) where ci,C 2 are some absolute constants. 
Using the fact that 




and 


we complete the proof. 


(52 


log|A4ljAA^| < 


□ 


A Proof of Lemma 3.6 

Proof. Recall that yi = ' Xi- We let 


Ui ^ yj 
Vi = PT-r^yj = —^ 


Wm 


Hh 


From condition (3.7), we have 


\yi - mh < < 5 ) Wvj - Vjh < 


(A.l) 


Let 9 = Z{xi,Xj), O' = Z(yi,yj). Without loss of generality, we assume our set K = 
is symmetric, i.e., if £C £ AT then —x £ K. Suppose we show for any two points Xi,Xj with 
{xi,Xj'^ > 0, inequality (3.8) holds, then for Xi,Xj with {xi,Xj'^ < 0, we immediately have 

\d{yi,yj) - d{xi,Xj)\ = |l - d{yi,yj) - (l - d{xi,Xj)) \ = \d{-yi,yj) - d{-Xi,Xj)\ < C6. 


In the second equality, we use d{—x,y) + d{x,y) = 1, x,y £ In the last inequality, we use 

the fact that fast JL transform is linear thus —yi = ^J^^{Q){—Xi). Therefore, without 

loss of generality, we assume (ajj, Xj) > 0 thus 0 < ^. 

Now we turn to the following quantity 

11^* “ = 11^* -Vi + Vi- Vj +yj - yj\\2 

< ll^i - yi\\2 + ll^j - yj\\2 +11^* “ yj\\2 ^ 25 + \\xi - Xj\\2{i + (5). 

The last inequality follows from (A.l) and condition (3.6). Similarly, we also have 

\\yi - yj \\2 > \\xi - Xj\\{l -6) - 26. 

Using the fact that 


. O' Wyi-yjWo . 0 

sm — - - , sm - - 


I Xi Xj I j 2 


we have 


sm-sm — = 

2 2 ' 


\yi - yj\ 


Xi - X 


^1 < 5 + (5' 


\Xi — X 


P\2 


< 26. 
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When S < 


we have 


0' 6» V3-\/2 \/3 

sin — < sin — I-< -. 


In the last inequality, we use sin | < € [0,7r/2]. So 0'/2 G [0, tt/S]. Using the fact that, for 

any two 0,0' G [0, vr/S], there exists constant c such that 

I sin 0 — sin 0'\ > c\0 — 0'\, 


we have that 


Therefore, 


\0 

O' \ 

1, 

0' 


^ 2h 



< - 

sm — 

• 

— sm — 

< — 

'2 ~ 

2 1 

c 

2 

2' 

c 


\d{yi,yj) - d{xi,Xj)\ = -\0 - 0'\ < C6. 

TT 

In the case 6 > trivially we have \d{yi,yj) — d{xi,Xj)\ < 2 < C6 with constant C = 

_8_ □ 

v/3-U2- 

B Proof of Lemma 5.4 

Proof. For positive semidefinite matrix M G with rank d, let Ai,A 2 ,..-Arf be its positive 

eigenvalues. Using the dehnition of Frobenius norm, we have 


*=1 i,j&[n] 

On the other hand, considering the trace of M, we can obtain 

d 


Aj = Trace(M) > p{l — hi). 


2 = 1 

Using Cauchy-Schwarz inequality, we have 

(Ea.)^<^Ea?. 

2=1 2=1 

Plugging (B.l) and (B.2) into the above inequality yields 

p(l-hl)2 


d> 


1 -h (p- l)h| 


2 • 


(B.l) 


(B.2) 


□ 
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C Proof of Lemma 5.12 


Proof. Without loss of any generality, we assume K = {x & : supp(a;) C {1,2}}. We begin 

with constructing a (i-net denoted as Afs for set K. For simplicity, we can just let Ms{K) be the set 
of points that split the circle spanned by { 61 , 62 } uniformly. Therefore \Afs{K)\ = 0{^). Applying 
Proposition 2.2 gives us 

sup \dA{x,y) - d{x,y)\ < 6, (C.l) 

x,yeJ\fs 

holds with probability at least 1 — 2 exp {—5‘^m) when rn > log(j). 

For any point x £ K, only depends on the first two coordinates of Aj. Therefore, for 

simplicity, we let A^ = , V i G [m]. For any point say Xi G Afs, using the uniform 

distribution of A^, we have 

Pr(|(A',a;i>|<5)<C5, 

holds with some absolute constant C. Using Hoeffding’s inequality and probabilistic union bound 
over all points in Afs, we have 


Pr 


- //t 

x&Ms rn 


< lA/^l exp(—25^m) < exp(—(5^m). 


(C.2) 


The last inequality holds when m > ^ log ^. 

Now we consider any point x £ K. Suppose Xi is the closest point to x in Afs- We note that if 
sign ((A', x)) / sign (( A', ®i)), then there exists A G [0,1] such that 


(A', A® + (1 — A)a;i) = 0. 


We thus have 

|(A',a;i)| = A|(A',a; - a:i)| < A||a: - ®i ||2 < 5. 


Further we obtain that 


^ / / 1 

sup sup dA{x,xi) = sup sup —7 l(sign ((a', a;)) / sign ((a', aji))) 

x^K xiGMs xGK ^ 

II®—®l||2<^ 

^ m 

xi&Ns 


Combining the above result with (C.2), we obtain that, with probability at least 1 — exp(—5^m), 


sup sup dA{x,xi) < {C +1)6. (C.3) 

xi&Ms x&K 

II®—®! ||2^^ 
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For any points x,y £ K, let Xi,yi be their nearest points in Afs- We have 

\d{x, y) - dA{x, y)\ < \d{x, y) - d{xi,yi)\ + \d{xi,yi) - dA{x, y)\ 

(a) 

< \d{xi,yi) - dA{x,y)\ + 26 < \d{xi,yi) - dA{xi,yi)\ + \dA{xi,yi) - dA{x,y)\ + 26 

ip) 

< \dA{xi,yi) -dA{x,y)\ +36 < \dA{xi,yi) - dAixi,y)\ + \dA{xi,y) -dA{x,y)\ +3(5 

(c) (d) 

< dA{yi,y) + dA{xi,x) + 3(5 < {2C + 5)(5, 

where (a) follows from 

\d{x,y) - d{xi,yi)\ < \d{x,y) - d{xi,y)\ + \d{xi,y) - d{x,yi)\ < d{x,xi) + d{xi,yi) < 26, 

step (6) follows from (C.l), step (c) follows from the triangle inequality of Hamming distance, step 

{d) is from (C.3). □ 
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